{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Validating and Importing Item Metadata <a class=\"anchor\" id=\"top\"></a>\n",
    "\n",
    "In this notebook, you will pick up where you left off in `01_Validating_and_Importing_User_Item_Interaction_Data.ipynb` to build a working item metadata dataset. This will allow you to work with filters as well as later support the `User Personalization` or `HRNN-Metadata` algorithms.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](img/02.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run this notebook, you need to have run the previous notebook, `01_Validating_and_Importing_User_Item_Interaction_Data`, where you created a dataset and imported interaction data into Amazon Personalize. At the end of that notebook, you saved some of the variable values, which you now need to load into this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare your Item metadata <a class=\"anchor\" id=\"prepare\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "The next thing to be done is to load the data and confirm the data is in a good state, then save it to a CSV where it is ready to be used with Amazon Personalize.\n",
    "\n",
    "To get started, import a collection of Python libraries commonly used in data science."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from time import sleep\n",
    "import json\n",
    "from datetime import datetime\n",
    "import boto3\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next,open the data file and take a look at the first several rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                               title  \\\n",
       "0        1                    Toy Story (1995)   \n",
       "1        2                      Jumanji (1995)   \n",
       "2        3             Grumpier Old Men (1995)   \n",
       "3        4            Waiting to Exhale (1995)   \n",
       "4        5  Father of the Bride Part II (1995)   \n",
       "\n",
       "                                        genres  \n",
       "0  Adventure|Animation|Children|Comedy|Fantasy  \n",
       "1                   Adventure|Children|Fantasy  \n",
       "2                               Comedy|Romance  \n",
       "3                         Comedy|Drama|Romance  \n",
       "4                                       Comedy  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data = pd.read_csv(dataset_dir + '/movies.csv')\n",
    "original_data.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>9742.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>42200.353623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>52160.494854</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>3248.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>7300.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>76232.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>193609.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             movieId\n",
       "count    9742.000000\n",
       "mean    42200.353623\n",
       "std     52160.494854\n",
       "min         1.000000\n",
       "25%      3248.250000\n",
       "50%      7300.000000\n",
       "75%     76232.000000\n",
       "max    193609.000000"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This does not really tell us much about the dataset, so we will explore a bit more for just raw info. We can see that genres are often grouped together, and that is fine for us as Personalize does support this structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 9742 entries, 0 to 9741\n",
      "Data columns (total 3 columns):\n",
      " #   Column   Non-Null Count  Dtype \n",
      "---  ------   --------------  ----- \n",
      " 0   movieId  9742 non-null   int64 \n",
      " 1   title    9742 non-null   object\n",
      " 2   genres   9742 non-null   object\n",
      "dtypes: int64(1), object(2)\n",
      "memory usage: 228.5+ KB\n"
     ]
    }
   ],
   "source": [
    "original_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this, you can see that there are a total of (62,000+ for full 9742 for small) entries in the dataset, with 3 columns.\n",
    "\n",
    "This is a pretty minimal dataset of just the movieId, title and the list of genres that are applicable to each entry. However there is additional data available in the Movielens dataset. For instance the title includes the year of the movies release. Let's make that another column of metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                               title  \\\n",
       "0        1                    Toy Story (1995)   \n",
       "1        2                      Jumanji (1995)   \n",
       "2        3             Grumpier Old Men (1995)   \n",
       "3        4            Waiting to Exhale (1995)   \n",
       "4        5  Father of the Bride Part II (1995)   \n",
       "\n",
       "                                        genres  year  \n",
       "0  Adventure|Animation|Children|Comedy|Fantasy  1995  \n",
       "1                   Adventure|Children|Fantasy  1995  \n",
       "2                               Comedy|Romance  1995  \n",
       "3                         Comedy|Drama|Romance  1995  \n",
       "4                                       Comedy  1995  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data['year'] =original_data['title'].str.extract('.*\\((.*)\\).*',expand = False)\n",
    "original_data.head(5)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "can't assign to function call (<ipython-input-7-b6a5921fcbc9>, line 1)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-7-b6a5921fcbc9>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m    original_data['CreationTimeStamp'] = datetime(2015, 10, 19),expand = False\u001b[0m\n\u001b[0m                                        ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m can't assign to function call\n"
     ]
    }
   ],
   "source": [
    "# original_data['CreationTimeStamp'] = datetime(2015, 10, 19),expand = False\n",
    "# original_data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From an item metadata perspective, we only want to include information that is relevant to training a model and/or filtering resulte, so we will drop the title, retaining the genre information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>genres</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                                       genres  year\n",
       "0        1  Adventure|Animation|Children|Comedy|Fantasy  1995\n",
       "1        2                   Adventure|Children|Fantasy  1995\n",
       "2        3                               Comedy|Romance  1995\n",
       "3        4                         Comedy|Drama|Romance  1995\n",
       "4        5                                       Comedy  1995"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "itemmetadata_df = original_data.copy()\n",
    "itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]\n",
    "itemmetadata_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After manipulating the data, always confirm if the data format has changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "movieId     int64\n",
       "genres     object\n",
       "year       object\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "itemmetadata_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amazon Personalize has a default column for `ITEM_ID` that will map to our `movieId`, and now we can flesh out more information by specifying `GENRE` as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_df.rename(columns = {'genres':'GENRE', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it! At this point the data is ready to go, and we just need to save it as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_filename = \"item-meta.csv\"\n",
    "itemmetadata_df.to_csv((data_dir+\"/\"+itemmetadata_filename), index=False, float_format='%.0f')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configure the SDK to Personalize:\n",
    "personalize = boto3.client('personalize')\n",
    "personalize_runtime = boto3.client('personalize-runtime')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the dataset\n",
    "\n",
    "First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).\n",
    "\n",
    "Here, you will create a schema for item metadata data, which needs the `ITEM_ID` and `GENRE` fields. These must be defined in the same order in the schema as they appear in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"schemaArn\": \"arn:aws:personalize:us-east-1:835319576252:schema/personalize-poc-movielens-item\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"7fdbc984-a921-40af-a92c-feacb171b778\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 15 Sep 2020 02:10:10 GMT\",\n",
      "      \"x-amzn-requestid\": \"7fdbc984-a921-40af-a92c-feacb171b778\",\n",
      "      \"content-length\": \"96\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "itemmetadata_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Items\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"GENRE\",\n",
    "            \"type\": \"string\",\n",
    "            \"categorical\": True\n",
    "        },{\n",
    "            \"name\": \"YEAR\",\n",
    "            \"type\": \"int\",\n",
    "        },\n",
    "        \n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "create_schema_response = personalize.create_schema(\n",
    "    name = \"personalize-poc-movielens-item\",\n",
    "    schema = json.dumps(itemmetadata_schema)\n",
    ")\n",
    "\n",
    "itemmetadataschema_arn = create_schema_response['schemaArn']\n",
    "print(json.dumps(create_schema_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a schema created, you can create a dataset within the dataset group. Note, this does not load the data yet. This will happen a few steps later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetArn\": \"arn:aws:personalize:us-east-1:835319576252:dataset/personalize-poc-movielens25m/ITEMS\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"42ff4981-d1b9-44a9-a04b-9aae006b4c6e\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 15 Sep 2020 02:10:10 GMT\",\n",
      "      \"x-amzn-requestid\": \"42ff4981-d1b9-44a9-a04b-9aae006b4c6e\",\n",
      "      \"content-length\": \"102\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "dataset_type = \"ITEMS\"\n",
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"personalize-poc-movielens-items\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_arn,\n",
    "    schemaArn = itemmetadataschema_arn\n",
    ")\n",
    "\n",
    "items_dataset_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upload data to S3\n",
    "\n",
    "Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_file_path = data_dir + \"/\" + itemmetadata_filename\n",
    "boto3.Session().resource('s3').Bucket(bucket_name).Object(itemmetadata_filename).upload_file(itemmetadata_file_path)\n",
    "interactions_s3DataPath = \"s3://\"+bucket_name+\"/\"+itemmetadata_filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import the item metadata <a class=\"anchor\" id=\"import\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "Earlier you created the dataset group and dataset to house your information, so now you will execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:835319576252:dataset-import-job/personalize-poc-item-import1\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"955a9a4c-21d2-4b03-aa3d-ba5cff3073a9\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 15 Sep 2020 02:10:12 GMT\",\n",
      "      \"x-amzn-requestid\": \"955a9a4c-21d2-4b03-aa3d-ba5cff3073a9\",\n",
      "      \"content-length\": \"116\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"personalize-poc-item-import1\",\n",
    "    datasetArn = items_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, itemmetadata_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every second, up to a maximum of 6 hours.\n",
    "\n",
    "Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetImportJob: CREATE PENDING\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: CREATE IN_PROGRESS\n",
      "DatasetImportJob: ACTIVE\n",
      "CPU times: user 83.1 ms, sys: 16.8 ms, total: 100 ms\n",
      "Wall time: 17min 1s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "max_time = time.time() + 6*60*60 # 6 hours\n",
    "while time.time() < max_time:\n",
    "    describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n",
    "        datasetImportJobArn = dataset_import_job_arn\n",
    "    )\n",
    "    status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n",
    "    print(\"DatasetImportJob: {}\".format(status))\n",
    "    \n",
    "    if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n",
    "        break\n",
    "        \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this import now complete you can enable filtering for your recommendations as well as support `HRNN-Metadata`. Run the cell below before moving on to store a few values for usage in the next notebooks. After completing that cell open notebook `03_Creating_and_Evaluating_Solutions.ipynb` to continue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored 'items_dataset_arn' (str)\n",
      "Stored 'itemmetadataschema_arn' (str)\n"
     ]
    }
   ],
   "source": [
    "%store items_dataset_arn\n",
    "%store itemmetadataschema_arn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
